[Models] Add `SharedFusedMoE` support to Qwen3MoE by Isotr0py · Pull Request #32082 · vllm-project/vllm

Isotr0py · 2026-01-10T07:54:07Z

Purpose

Fix [Perf] Use vLLM's SharedFusedMoE in Qwen3-Omni vllm-omni#560 (comment)
Qwen3-omni's MoE talker has share experts in its sparse moe block, while vLLM's Qwen3MoE impl assume no share expert inside sparse moe block. Therefore, we had to have a duplicate Qwen3MoE sparse moe block implementation at vLLM-omni.
This PR upstream it to vLLM to avoid duplicate implementation.

Test Plan

vllm side with:

python examples/offline_inference/vision_language.py -m qwen3_vl

vllm-omni side with https://github.com/Isotr0py/vllm-omni/tree/check-qwen3-omni-moe-talker:

python examples/offline_inference/qwen3_omni/end2end.py

Test Result

Test with Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 with tp_size=2:

INFO 01-12 00:14:59 [utils.py:254] non-default args: {'max_model_len': 4096, 'tensor_parallel_size': 2, 'max_num_seqs': 5, 'limit_mm_per_prompt': {'image': 1, 'video': 0, 'audio': 0}, 'mm_processor_kwargs': {'min_pixels': 784, 'max_pixels': 1003520, 'fps': 1}, 'model': '/home/mozf/LLM/Qwen3-VL-30B-A3B-Instruct-FP8'}
INFO 01-12 00:14:59 [model.py:528] Resolved architecture: Qwen3VLMoeForConditionalGeneration
...
(EngineCore_DP0 pid=3767062) INFO 01-12 00:17:25 [kv_cache_utils.py:1305] GPU KV cache size: 98,176 tokens
(EngineCore_DP0 pid=3767062) INFO 01-12 00:17:25 [kv_cache_utils.py:1310] Maximum concurrency for 4,096 tokens per request: 23.97x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  8.36it/s]
Capturing CUDA graphs (decode, FULL): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  8.06it/s]
(EngineCore_DP0 pid=3767062) (Worker_TP0 pid=3767068) INFO 01-12 00:17:27 [gpu_model_runner.py:4837] Graph capturing finished in 2 secs, took 0.19 GiB
(EngineCore_DP0 pid=3767062) INFO 01-12 00:17:28 [core.py:273] init engine (profile, create kv cache, warmup model) took 112.92 seconds
(EngineCore_DP0 pid=3767062) INFO 01-12 00:17:31 [core.py:186] Batch queue is enabled with size 2
INFO 01-12 00:17:31 [llm.py:344] Supported tasks: ['generate']
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.17it/s]
Processed prompts: 100%|███████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.66it/s, est. speed input: 1621.29 toks/s, output: 105.99 toks/s]
--------------------------------------------------
This image captures a beautiful spring scene in Japan, featuring the Tokyo Skytree framed by blooming cherry blossoms. The photograph is taken from a low angle, looking up through the branches of a cherry blossom tree, with its pink flowers in full bloom creating a delicate, natural frame around the iconic tower. The clear blue
--------------------------------------------------
This image captures a beautiful spring scene in Japan, featuring the Tokyo Skytree framed by blooming cherry blossoms (sakura). The photo is taken from a low angle, looking up through the branches of a cherry blossom tree, with its pink flowers in the foreground creating a natural, delicate frame around the iconic tower
--------------------------------------------------
This image captures a beautiful spring scene in Japan, featuring the Tokyo Skytree framed by blooming cherry blossoms. The photograph is taken from a low angle, looking up through the branches of a cherry blossom tree, with its delicate pink flowers in the foreground. The iconic white tower of the Tokyo Skytree rises in the
--------------------------------------------------
This image captures a beautiful spring scene in Japan, featuring the Tokyo Skytree framed by blooming cherry blossoms. The photograph is taken from a low angle, looking up through the branches of a cherry blossom tree, which creates a natural frame around the iconic tower. The vibrant pink flowers contrast beautifully with the clear blue sky
--------------------------------------------------

vllm-omni generates reasonable audio outputs with this impl.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Note

^{Cursor Bugbot is generating a summary for commit bf79a7a. Configure here.}

Note

Introduces shared-expert support in Qwen3MoE using SharedFusedMoE.

Replace FusedMoE with SharedFusedMoE; add gate, optional shared_expert_gate, and shared_expert MLP (controlled by shared_expert_intermediate_size)
Update forward in sparse MoE block to compute shared_out and fused_out, sum when present, and perform TP all-reduce when needed
Extend Qwen3MoeMLP to accept an optional expert_gate and apply sigmoid gating to outputs
Switch expert params mapping to SharedFusedMoE.make_expert_params_mapping; set reduce_results=False to defer reduction

^{Written by Cursor Bugbot for commit bf79a7a. This will update automatically on new commits. Configure here.}

Note

Introduces shared-expert support in Qwen3MoE via SharedFusedMoE and optional shared MLP gating.

Replace FusedMoE with SharedFusedMoE in sparse MoE block; add gate, optional shared_expert_gate, and shared_expert MLP (controlled by shared_expert_intermediate_size)
Update forward to compute shared_out and fused_out, sum when present, and perform TP all-reduce when not sequence-parallel; set reduce_results=False
Extend Qwen3MoeMLP to accept expert_gate and apply sigmoid gating to outputs
Use SharedFusedMoE.make_expert_params_mapping for expert weight loading

^{Written by Cursor Bugbot for commit f4ebf48. This will update automatically on new commits. Configure here.}

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

gemini-code-assist

Code Review

This pull request adds support for SharedFusedMoE to Qwen3MoE, which is a valuable enhancement for handling models with shared experts. The overall approach is sound, but I've identified two critical issues that would lead to runtime errors. One is an incorrect method call with an undefined argument, and the other is improper handling of the return value from SharedFusedMoE.forward. Please see my detailed comments for suggestions on how to fix these issues.

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Isotr0py · 2026-01-12T03:07:17Z

/gemini review

gemini-code-assist

Code Review

This pull request adds support for shared experts in Qwen3MoE by integrating SharedFusedMoE. The changes correctly set up the shared expert and its gating mechanism. However, I've found a critical issue in the tensor parallelism logic where the final output is not correctly reduced across tensor parallel ranks, which will lead to incorrect model outputs when tp_size > 1. I've provided a fix for this issue.

cursor · 2026-01-13T08:33:57Z

        hidden_act: str,
        quant_config: QuantizationConfig | None = None,
        reduce_results: bool = True,
+        expert_gate: torch.nn.Linear | None = None,


Incorrect type hint causes wrong tensor indexing

Low Severity

The expert_gate parameter is typed as torch.nn.Linear | None but is actually a ReplicatedLinear. The code does self.expert_gate(x)[0] to extract the output from a tuple returned by ReplicatedLinear.forward. However, torch.nn.Linear.forward returns a tensor directly, not a tuple, so [0] would incorrectly index into the first dimension of the tensor instead of extracting the output. While the current code works because only ReplicatedLinear is passed, the type annotation is misleading and using torch.nn.Linear as documented would produce silently incorrect results.

Additional Locations (1)

vllm/model_executor/models/qwen3_moe.py#L122-L123

gcanlin

LGTM, thanks! cc @ywang96.

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

mergify · 2026-01-22T14:47:09Z

Hi @Isotr0py, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: 陈建华 <1647430658@qq.com>

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

…#6335) ### What this PR does / why we need it? PR vllm-project/vllm#32082 in vLLM makes Qwen3-Moe models also go into `SharedFusedMoE`, while current implementation of our `AscendSharedFusedMoE` assumes shared_experts always exist. This PR adds checking to `multistream_overlap_shared_expert` and `multistream_overlap_gate` in order to only enable these features when shared experts exist. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? All ci passed - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc Signed-off-by: whx-sjtu <2952154980@qq.com>

…vllm-project#6335) ### What this PR does / why we need it? PR vllm-project/vllm#32082 in vLLM makes Qwen3-Moe models also go into `SharedFusedMoE`, while current implementation of our `AscendSharedFusedMoE` assumes shared_experts always exist. This PR adds checking to `multistream_overlap_shared_expert` and `multistream_overlap_gate` in order to only enable these features when shared experts exist. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? All ci passed - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc Signed-off-by: whx-sjtu <2952154980@qq.com>

### What this PR does / why we need it? This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This involves: - Updating the `VLLM_TAG` in all `Dockerfile`. - Updating the vLLM version in `docs/source/conf.py`. - Removing conditional code paths specific to `v0.14.1` across the codebase, which simplifies maintenance. - Fix `TypeError: MMEncoderAttention.__init__() got an unexpected keyword argument 'multimodal_config'` due to vllm-project/vllm#31972. - Fix `_shared_experts: 'NoneType' object is not callable` due to vllm-project/vllm#32082 by #6335. - Fix `ReshapeAndCacheOperation setup failed!` due to vllm-project/vllm#25954 by overriding attention metadata slots. This upgrade is necessary to keep the project aligned with the latest features, bug fixes, and API changes in the vLLM project. ### Does this PR introduce _any_ user-facing change? No, this is an internal dependency update and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with these changes, ensuring that all existing tests are successful with the new vLLM version. - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc co-authored-by: shen-shanshan <467638484@qq.com> --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

### What this PR does / why we need it? This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This involves: - Updating the `VLLM_TAG` in all `Dockerfile`. - Updating the vLLM version in `docs/source/conf.py`. - Removing conditional code paths specific to `v0.14.1` across the codebase, which simplifies maintenance. - Fix `TypeError: MMEncoderAttention.__init__() got an unexpected keyword argument 'multimodal_config'` due to vllm-project/vllm#31972. - Fix `_shared_experts: 'NoneType' object is not callable` due to vllm-project/vllm#32082 by vllm-project#6335. - Fix `ReshapeAndCacheOperation setup failed!` due to vllm-project/vllm#25954 by overriding attention metadata slots. This upgrade is necessary to keep the project aligned with the latest features, bug fixes, and API changes in the vLLM project. ### Does this PR introduce _any_ user-facing change? No, this is an internal dependency update and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with these changes, ensuring that all existing tests are successful with the new vLLM version. - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc co-authored-by: shen-shanshan <467638484@qq.com> --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

…vllm-project#6335) ### What this PR does / why we need it? PR vllm-project/vllm#32082 in vLLM makes Qwen3-Moe models also go into `SharedFusedMoE`, while current implementation of our `AscendSharedFusedMoE` assumes shared_experts always exist. This PR adds checking to `multistream_overlap_shared_expert` and `multistream_overlap_gate` in order to only enable these features when shared experts exist. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? All ci passed - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>

### What this PR does / why we need it? This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This involves: - Updating the `VLLM_TAG` in all `Dockerfile`. - Updating the vLLM version in `docs/source/conf.py`. - Removing conditional code paths specific to `v0.14.1` across the codebase, which simplifies maintenance. - Fix `TypeError: MMEncoderAttention.__init__() got an unexpected keyword argument 'multimodal_config'` due to vllm-project/vllm#31972. - Fix `_shared_experts: 'NoneType' object is not callable` due to vllm-project/vllm#32082 by vllm-project#6335. - Fix `ReshapeAndCacheOperation setup failed!` due to vllm-project/vllm#25954 by overriding attention metadata slots. This upgrade is necessary to keep the project aligned with the latest features, bug fixes, and API changes in the vLLM project. ### Does this PR introduce _any_ user-facing change? No, this is an internal dependency update and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with these changes, ensuring that all existing tests are successful with the new vLLM version. - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc co-authored-by: shen-shanshan <467638484@qq.com> --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>

…vllm-project#6335) ### What this PR does / why we need it? PR vllm-project/vllm#32082 in vLLM makes Qwen3-Moe models also go into `SharedFusedMoE`, while current implementation of our `AscendSharedFusedMoE` assumes shared_experts always exist. This PR adds checking to `multistream_overlap_shared_expert` and `multistream_overlap_gate` in order to only enable these features when shared experts exist. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? All ci passed - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This involves: - Updating the `VLLM_TAG` in all `Dockerfile`. - Updating the vLLM version in `docs/source/conf.py`. - Removing conditional code paths specific to `v0.14.1` across the codebase, which simplifies maintenance. - Fix `TypeError: MMEncoderAttention.__init__() got an unexpected keyword argument 'multimodal_config'` due to vllm-project/vllm#31972. - Fix `_shared_experts: 'NoneType' object is not callable` due to vllm-project/vllm#32082 by vllm-project#6335. - Fix `ReshapeAndCacheOperation setup failed!` due to vllm-project/vllm#25954 by overriding attention metadata slots. This upgrade is necessary to keep the project aligned with the latest features, bug fixes, and API changes in the vLLM project. ### Does this PR introduce _any_ user-facing change? No, this is an internal dependency update and does not introduce any user-facing changes. ### How was this patch tested? CI is expected to pass with these changes, ensuring that all existing tests are successful with the new vLLM version. - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc co-authored-by: shen-shanshan <467638484@qq.com> --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>